Qualitative Analysis of Partially-observable Markov 
Decision Processes 



Rrishnendu Chatterjee 1 , Laurent Doyen 2 , and Thomas A. Henzinger 1 
1 1ST Austria (Institute of Science and Technology Austria) 



Abstract. We study observation-based strategies for partially-observable 
Markov decision processes (POMDPs) with parity objectives. An observation- 
based strategy relies on partial information about the history of a play, namely, 
on the past sequence of observations. We consider qualitative analysis prob- 
lems: given a POMDP with a parity objective, decide whether there exists an 
observation-based strategy to achieve the objective with probability 1 (almost- 
sure winning), or with positive probability (positive winning). Our main results 
are twofold. First, we present a complete picture of the computational complex- 
ity of the qualitative analysis problem for POMDPs with parity objectives and 
its subclasses: safety, reachability, Biichi, and coBiichi objectives. We establish 
several upper and lower bounds that were not known in the literature, and present 
efficient and symbolic algorithms for the decidable subclasses. Second, we give, 
for the first time, optimal bounds (matching upper and lower bounds) for the 
memory required by pure and randomized observation-based strategies for all 
classes of objectives. 



1 Introduction 

Markov decision processes. A Markov decision process (MDP) is a model for systems 
that exhibit both probabilistic and nondeterministic behavior. MDPs have been used to 
model and solve control problems for stochastic systems: there, nondeterminism rep- 
resents the freedom of the controller to choose a control action, while the probabilistic 
component of the behavior describes the system response to control actions. MDPs have 
also been adopted as models for concurrent probabilistic systems, probabilistic systems 
operating in open environments [23], and under-specified probabilistic systems [6]. 

System specifications. The specification describes the set of desired behaviors of the 
system, and is typically an cj-regular set of paths. Parity objectives are a canonical 
way to define such specifications in MDPs. They include reachability, safety, Biichi 
and coBiichi objectives as special cases. Thus MDPs with parity objectives provide 
the theoretical framework to study problems such as the verification and the control of 
stochastic systems. 

Perfect vs. partial observations. Most results about MDPs make the hypothesis of 
perfect observation. In this setting, the controller always knows, while interacting with 
the system (or MDP), the exact state of the MDP. In practice, this hypothesis is often 
unrealistic. For example, in the control of multiple processes, each process has only 



access to the public variables of the other processes, but not to their private variables. 
In the control of hybrid systems [7, 13], or in automated planning [19], the controller 
usually has noisy information about the state of the systems due to finite-precision sen- 
sors. In such applications, MDPs with partial observation (POMDPs) provide a more 
appropriate model. 

Qualitative and quantitative analysis. Given an MDP with parity objective, the qual- 
itative analysis asks for the computation of the set of almost-sure winning states (resp., 
positive winning states) in which the controller can achieve the parity objective with 
probability 1 (resp., positive probability); the more general quantitative analysis asks 
for the computation at each state of the maximal probability with which the controller 
can satisfy the parity objective. The analysis of POMDPs is considerably more com- 
plicated than the analysis of MDPs. First, the decision problems for POMDPs usu- 
ally lie in higher complexity classes than their perfect-observation counterparts: for 
example, the quantitative analysis of POMDPs with reachability and safety objectives 
is undecidable [21], whereas for MDPs with perfect observation, this question can be 
solved in polynomial time [11,10]. Second, in the context of POMDPs, witness winning 
strategies for the controller need memory even for the simple objectives of safety and 
reachability. This is again in contrast to the perfect-observation case, where memoryless 
strategies suffice for all parity objectives. Since the quantitative analysis of POMDPs 
is undecidable (even for computing approximations of the maximal probabilities [19]), 
we study the qualitative analysis of POMDPs with parity objective and its subclasses. 

Contribution. For the qualitative analysis of POMDPs, the following results are 
known: (a) the problems of deciding if a state is almost-sure winning for reachability 
and Biichi objectives can be solved in EXPTIME [1]; (b) the problems for almost-sure 
winning for coBiichi objectives and positive winning for Biichi objectives are unde- 
cidable [1, 14]; and (c) the EXPTIME-completeness of almost-sure winning for safety 
objectives follows from the results on games with partial observation [9,5]. Our new 
contributions are as follows: 

1. First, we show that (a) positive winning for reachability objectives is 
NLOGSPACE-complete; and (b) almost-sure winning for reachability and Biichi 
objectives, and positive winning for safety and coBiichi objectives are EXPTIME- 
hard 3 . We also present a new proof that positive winning for safety and coBiichi ob- 
jectives can be solved in EXPTIME 4 . It follows that almost-sure winning for reach- 
ability and Biichi, and positive winning for safety and coBiichi, are EXPTIME- 
complete. This completes the picture for the complexity of the qualitative analysis 
for POMDPs with parity objectives. Moreover our new proofs of EXPTIME upper- 
bound proofs yield efficient and symbolic algorithms to solve positive winning for 
safety and coBiichi objectives in POMDPs. 

2. Second, we present a complete characterization of the amount of memory required 
by pure (deterministic) and randomized strategies for the qualitative analysis of 

3 A very brief (two line) proof of EXPTIME-hardness is sketched in [12] (see the discussion 
before Theorem 5 for more details). 

4 A different proof that positive safety can be solved in EXPTIME is given in [15] (see the 
discussion after Theorem 2 for a comparison). 



POMDPs. For the first time, we present optimal memory bounds (matching upper 
and lower bounds) for pure and randomized strategies: we show that (a) for posi- 
tive winning of reachability objectives, randomized memoryless strategies suffice, 
while for pure strategies linear memory is necessary and sufficient; (b) for almost- 
sure winning of safety, reachability, and Biichi objectives, and for positive winning 
of safety and coBiichi objectives, exponential memory is necessary and sufficient 
for both pure and randomized strategies. 

Related work. Though MDPs have been widely studied under the hypothesis of per- 
fect observations, there are a few works that consider POMDPs, e.g., [20, 18] for sev- 
eral finite-horizon quantitative objectives. The results of [1] shows the upper bounds for 
almost-sure winning for reachability and Biichi objectives, and the work of [8] consid- 
ers a subclass of POMDPs with Biichi objectives and presents a PSPACE upper bound 
for the subclass. The undecidability of almost-sure winning for coBiichi and positive 
winning for Biichi objectives is established by [1, 14]. We present a solution to the re- 
maining problems related to the qualitative analysis of POMDPs with parity objectives, 
and complete the picture. Partial information has been studied in the context of two- 
player games [22,9], a model that is incomparable to MDPs, though some techniques 
(like the subset construction) can be adapted to the context of POMDPs. More general 
models of stochastic games with partial information have been studied in [3, 15], and lie 
in higher complexity classes. For example, a result of [3] shows that the decision prob- 
lem for positive winning of safety objectives is 2EXPTIME-complete in the general 
model, while for POMDPs, we show that the same problem is EXPTIME-complete. 

2 Definitions 

A probability distribution on a finite set A is a function k : A — >• [0,1] such that 
J2aeA K ( a ) = !■ The support of k is the set Supp(/-j) = {a E A | n(a) > 0}. We 
denote by V{A) the set of probability distributions on A. 

Games and MDPs. A two-player game structure or a. Markov decision process (MDP) 
(of partial observation) is a tuple G = (L, E, 6, O), where L is a finite set of states, 
E is a finite set of actions, O C 2 L is a set of observations that partition 5 the state 
space L. We denote by obs(^) the unique observation o e O such that I e o. In the 
case of games, 5 C L x E x L is a set of labeled transitions; in the case of MDPs, 
5 : L x E — > T>(L) is a probabilistic transition function. For games, we require that for 
alH e L and all a e E, there exists £' G L such that (£, a, £') e 5. We refer to a game 
of partial observation as a POG and to an MDP of partial observation as a POMDP. 
We say that G is a game or MDP of perfect observation if O = {{£} | I <E L}. For 
(TG^andsCL, define Post^(s) = {£' e L \ 31 e s : {I, a, £') e 5} when G is a 
game, and Post^(s) = {£' e L \ 3£ e s : S(£, a)(£') > 0} when G is an MDP. 

Plays. Games are played in rounds in which Player 1 chooses an action in E, and 
Player 2 resolves nondeterminism by choosing the successor state; in MDPs the suc- 
cessor state is chosen according to the probabilistic transition function. A play in G is 

5 A slightly more general model with overlapping observations can be reduced in polynomial 
time to partitioning observations [9]. 



an infinite sequence ir = 4°o4 ■ ■ ■ a n -i£ n a n . . . such that £ i+ i £ Post^. ({£i}) for all 
i > 0. The infinite sequence obs(7r) = obs(4)ooobs(4) . . . a n -iobs(£ n )a n ... is the 
observation of ir. 

The set of infinite plays in G is denoted Plays(G), and the set of finite prefixes 
£o0o • • • °n-i4 °f plays is denoted Prefs(G). A state I £ L is reachable in G if there 
exists a prefix p £ Prefs(G) such that Last(p) = I where Last(p) is the last state of p. 

Strategies. A pure strategy in G for Player 1 is a function a : Prefs(G) — > S. A 
randomized strategy in G for Player 1 is a function a : Prefs(G) — > T>(£). A (pure 
or randomized) strategy a for Player 1 is observation-based if for all prefixes p, p' £ 
Prefs(G), if obs(p) = obs(p'), then a(p) = a(p'). In the sequel, we are interested 
in the existence of observation-based strategies for Player 1. A pure strategy in G for 
Player 2 is a function (3 : Prefs(G) x£-)L such that for all p £ Prefs(G) and all 
<T £ S, we have (Last(p), a, (3(p, a)) £ 5. A randomized strategy in G for Player 2 is 
a function j3 : Prefs(G) x £ -> V{L) such that for all p £ Prefs(G), all a £ E, and 
alH £ Supp(/3(/9, a)), we have (Last(/?), cr, I) £ S. We denote by Ac, Aq, and £> G the 
set of all Player- 1 strategies, the set of all observation-based Player- 1 strategies, and the 
set of all Player-2 strategies in G, respectively. 

Memory requirement of strategies. An equivalent definition of strategies is as follows. 
Let Mem be a set called memory. An observation-based strategy with memory can be 
described by two functions, a memory-update function a u : Mem xOx^-> Mem that 
given the current memory, observation and the action updates the memory, and a next- 
action function a n : Mem x O — >• 'D(S) that given the current memory and current 
observation specifies the probability distribution 6 of the next action, respectively. A 
strategy is finite-memory if the memory Mem is finite and the size of a finite-memory 
strategy a is the size Mem | of its memory. A strategy is memoryless if | Mem | = 1. The 
memoryless strategies do not depend on the history of a play, but only on the current 
state. Memoryless strategies for player 1 can be viewed as functions a: O — > V{E). 

Objectives. An objective for G is a set of infinite sequences of states and actions, 
that is, <j) C (L x S) u . We consider objectives that are Borel measurable, i.e., sets in 
the Cantor topology on (L x U)" [17]. We specifically consider reachability, safety, 
Biichi, coBiichi, and parity objectives, all of them being Borel measurable. The parity 
objectives are a canonical form to express all w-regular objectives [24]. For a play ir = 
Iqo§1\ . . . , we denote by Inf (ir) = {£ £ L \ I = li for infinitely many i's} the set of 
states that appear infinitely often in 7r. 

- Reachability and safety objectives. Given a set T C L of target states, the reach- 
ability objective Reach(T) = { £q(TqI\^i ■ ■ ■ G Plays(G) | 3k > : 4 £ T } 
requires that a target state in Tbe visited at least once. Dually, the safety objective 
Safe(T) = { tn<Juli<7i ■ ■ ■ G Plays(G) | Vfc > : 4 £ T } requires that only 
states in Tbe visited; the objective Until(Ti, T) = {£qo§£\o~\ . . . £ Plays(G) 
3k > : 4 £ Ti A Vj < k : 4 £ Ti } requires that only states in T\ be visited 
before a state in T is visited; 

6 For a pure strategy, the next-action function specifies a single action rather than a probability 
distribution. 



- Btichi and coBilchi objectives. The Biichi objective Biichi(T) = {tt | lnf(7r)nT^ 
0} requires that a state in Tbe visited infinitely often. Dually, the coBilchi objective 
coBiichi(T) = {tt I Inf (7r) C T} requires that only states in T be visited infinitely 
often; and 

- Parity objectives. For d € N, let p : L — > { 0, 1, . . . , d } be a priority function that 
maps each state to a nonnegative integer priority. The parity objective Parity(p) = 
{ tt | min{ p(£) \ £ 6 Inf (7r) } is even } requires that the smallest priority that 
appears infinitely often be even. 

Note that the objectives Biichi(T) and coBiichi(T) are special cases of parity objec- 
tives defined by respective priority functions p\ , p 2 such that p\ {£) = and p 2 {£) = 2 
if £ € T, and p\ (£) = p 2 {£) = 1 otherwise. An objective is visible if it depends only 
on the observations; formally, (f> is visible if, whenever tt 6 <p an d obs(7r) = obs(7r'), 
then tt' e 4>. In this work, all our upper bound results are for the general parity ob- 
jectives (not necessarily visible), and all the lower bound results for POMDPs are for 
the special case of visible objectives (and hence the lower bounds also hold for general 
objectives). 

Almost-sure and positive winning. An event is a measurable set of plays, and given 
strategies a and B for the two players (resp., a strategy a for Player 1 in MDPs), the 
probabilities of events are uniquely defined [25]. For a Borel objective <j>, we denote by 
Pr" ,/3 (</>) (resp., Pr"(0) for MDPs) the probability that is satisfied from the starting 
state £ given the strategies a and (3 (resp., given the strategy a). Given a game G and 
a state £, a strategy a for Player 1 is almost-sure winning (resp., positive winning) 
for the objective 4> from £ if for all randomized strategies ft for Player 2, we have 
Pr"' ,9 (</>) = 1 (resp., Pr" ,,9 (0) > 0). Given an MDP G and a state £, a strategy a for 
Player 1 is almost-sure winning (resp. positive winning) for the objective <fr from £ if we 
have Pr"(</>) = 1 (resp., Pr"(0) > 0). We also say that state £ is almost-sure winning, 
or positive winning for respectively. We are interested in the problems of deciding 
the existence of an observation-based strategy for Player 1 that is almost-sure winning 
(resp., positive winning) from a given state £. 

3 Upper Bounds for the Qualitative Analysis of POMDPs 

In this section, we present upper bounds for the qualitative analysis of POMDPs. We 
first describe the known results. For qualitative analysis of MDPs, polynomial time up- 
per bounds are known for all parity objectives [1 1, 10]. It follows from the results of [9, 
1] that the decision problems for almost-sure winning for POMDPs with reachability, 
safety, and Biichi objectives can be solved in EXPTIME. It also follows from the results 
of [1] that the decision problem for almost-sure winning with coBiichi objectives and 
for positive winning with Biichi objectives is undecidable if the strategies are restricted 
to be pure, and the results of [14] shows that the problem remains undecidable even if 
randomized strategies are considered. In this section, we complete the results on upper 
bounds for the qualitative analysis of POMDPs: we present complexity upper bounds 
for the decision problems of positive winning with reachability, safety and coBiichi ob- 
jectives. The following result for reachability objectives is simple, and for a complete 
and systematic analysis we present the proof. 



Theorem 1. Given a POMDP G with a reachability objective and a starting state £, 
the problem of deciding whether there is a positive winning strategy from £ in G is 
NLOGSPACE-complete. 

Proof. The NLOGSPACE-completeness result for positive reachability for MDPs fol- 
lows from reductions to and from graph reachability. 

Reduction to graph reachability. Given a POMDP G = (L, S, S, O) and a set of target 
states T C L, consider the graph G = (L, E) where (£, £') G E if there exists an 
action a G S such that 6 (£,&)(£') > 0. Let £ be a starting state, then the following 
assertions hold: (a) if there is a path n in G from £ to a state t G T, then the randomized 
memoryless strategy for Player 1 in G that plays all actions uniformly at random ensures 
that the path ir is executed in G with positive probability (i.e., ensures positive winning 
for Reach (T) in G from £); and (b) if there is no path in G to reach T from £, then 
there is no strategy (and hence no observation-based strategy) for Player 1 in G to 
achieve Reach(T). This shows that positive winning in POMDPs can be decided in 
NLOGSPACE. Graphs are a special case of POMDPs and hence graph reachability can 
be reduced to reachability with positive probability in POMDPs, therefore the problem 
is NLOGSPACE-complete. ■ 

Positive winning for safety and coBiichi objectives. We now show that the decision 
problem for positive winning with safety and coBiichi objectives for POMDPs can 
be solved in EXPTIME. We first show with an example that the simple approach of 
reduction to a perfect-information MDP by subset construction and solving the perfect 
information MDP with safety objective for positive winning does not yield the desired 
result. 

Example 1. Consider the POMDP shown in Fig. 1: in every state there exists only one 
action (which we omit for simplicity). In other words, we have a partially observable 
Markov chain. States 0, 1, and 2 are safe states and form observation oi, while state 3 
forms observation 02 (which is not in the safe set). The state in G is positive winning 
for the safety objective as with positive probability the state 2 is reached and then the 
state 2 is visited forever. In contrast, consider the perfect information MDP G K obtained 
from G by subset construction (in this case G K is a Markov chain). In G K from the state 
{1, 2}, the possible successors are 1, 2, and 3, and since the observations are different 
at 1 and 2, as compared to 3, the successors of {1, 2} are {1, 2} and {3}. The reachable 
set of states in G K from the state {0} is shown in Fig. 1. In G K , the state {0} is not 
positive winning: the state {3} is the only recurrent state reachable from {0} and hence 
from the state {0}, with probability 1, the state {3} is reached and {3} is not a safe 
state. Note that all this holds regardless of the precise value of nonzero probabilities. 



Our result for positive safety and coBiichi objectives is based on the computation of 
almost-sure winning states for safety objectives, and on the following lemma. 

Lemma 1. Let G = (L, E, 5, O) be a POMDP and let T C L be the set of target 
states. If Player 1 has an observation-based strategy in G to satisfy Safe(T) with posi- 
tive probability from some state £, then there exists a state £' such that (a) Player 1 has 




an observation-based strategy in G to satisfy Until(7~, {£'}) with positive probability 
from £, and (b) Player 1 has an observation-based almost-sure winning strategy in G 
for Safe(T) from £'. 

Proof. We assume without loss of generality that the non-safe states in G are absorbing. 
Assume that Player 1 has an observation-based positive winning strategy a in G for 
the objective Safe(T) from £, and towards a contradiction assume that for all states £' 
reachable from I with positive probability using a in G, Player 1 has no observation- 
based almost-sure winning strategy for Safe(T) from £'. A standard argument shows 
that from every such state £', regardless of the observation-based strategy of Player 1, 
the probability to stay safe within the next n steps is at most 1—rf 1 where -q is the least 
non-zero probability in G and n is the number of states in G. Since under strategy a, 
every reachable state has this property, the probability to stay safe within k ■ n steps 
is at most (1 — i] n ) k . This value tends to when k — > oo, therefore the probability to 
stay safe using a from £ is 0, a contradiction. Hence, there exists a state £' which is 
almost-sure winning for Player 1 (using observation-based strategy a) and such that £' 
is reached with positive probability from £ while staying in T (again using a). ■ 

By Lemma 1, positive winning states can be computed as the set of states from 
which Player 1 can force with positive probability to reach an almost-sure winning 
state while visiting only safe states. Almost-sure winning states can be computed using 
the following subset construction. 

Given a POMDP G = (L, E, 6, O) and a set T C L of states, the knowledge-based 
subset construction of G is the game of perfect observation 

G K = (£,S,5 K ), 

where £ — 2 L \{0}, and for all S\,S2 € £ (in particular s 2 ^ 0) and a E S, we have 
(si, a, s 2 ) G S K iff there exists an observation o G O such that either s 2 = Post^(si) n 
ofl 7", or S2 = (Post^(si)no)\T. We refer to states in G K as cells. The following result 
is established using standard techniques (see e.g., Lemma 3.2 and Lemma 3.3 in [9]). 
and the fact that almost-sure winning and sure winning (sure winning is winning with 



certainty as compared to winning with probability 1 for almost-sure winning, see [9] 
for details of sure winning) coincide for safety objectives. 

Lemma 2. Let G — (L, S,5,0) be a POM DP and T C L a set of target states. Let 
G K be the subset construction and Fj- = {s C T} the set of safe cells. Player 1 has an 
almost-sure winning observation-based strategy in G for Safe(T) from i if and only if 
Player 1 has an almost-sure winning strategy in G K for Safe(F) from cell {£}. 

Remark 1. Lemma 2 also holds if we replace almost-sure winning by sure winning, 
since for safety objectives almost-sure and sure winning coincide. 

Theorem 2. Given a POMDP G with a safety objective and a starting state I, the 
problem of deciding whether there exists a positive winning observation-based strategy 
from £ can be solved in EXPTIME. 

Proof. The almost-sure winning states in G for a safety objective (with observation- 
based strategy) can be computed in exponential time using the subset construction (by 
Lemma 2 and [9]). Then, given the set W of cells that are almost-sure winning in G K , 
let Tw — € s I s G W} be the almost-sure winning states in G. We can compute 
the states from which Player 1 can force Tw to be reached with positive probability 
while staying within the safe states using standard graph analysis algorithms, as in 
Lemma 1 . Clearly such states are positive winning in G, and by Lemma 1 all positive 
winning states in G are obtained in this way. This gives an EXPTIME algorithm to 
decide from which states there exists a positive winning observation-based strategy for 
safety objectives. ■ 

Algorithms. The complexity bound of Theorem 2 has been established previously 
in [15], using an extension of the knowledge-based subset construction which is not 
necessary (where the state space is L x 2 L ). Our proof is simpler and also yield efficient 
and symbolic algorithms: efficient anti-chain based symbolic algorithm for almost-sure 
winning for safety objectives can be obtained from [9], and positive reachability is sim- 
ple graph reachability. 

The positive winning states for a coBiichi objective are computed as the set of 
almost-sure winning states for safety that can be reached with positive probability. 

Theorem 3. Given a POMDP G with a coBiichi objective and a starting state £, the 
problem of deciding whether there exists a positive winning observation-based strategy 
from i can be solved in EXPTIME. 

Proof. Let coBiichi (T) be a coBiichi objective in G = (L, E, 6, O). As in the proof 
of Theorem 2, we compute in exponential time the set Tw of almost-sure winning 
states in G for Safe(T), and using Lemma 1 the set W of states from which Player 1 
is positive winning for Reach(7W)- Clearly, all states in W are positive winning for 
coBiichi(T), and W can be computed in EXPTIME. We argue that for all states I g W, 
Player 1 is not positive winning for coBuchi(T) from I. Note that 6(£, o~){£') — for 
all t $ W, I' £ W, and a 6 S, and thus there are no almost-sure winning states 
for Safe(T) in G reachable from L\W with positive probability, regardless of the 



strategy of Player 1. Therefore, by an argument similar to the proof of Lemma 1, for 
all observation-based strategies for Player 1, from every state I $ W, the set L \ T 
is reached with probability 1 and the event Biichi(L \ T) has probability 1. The result 
follows. ■ 



4 Lower Bounds for the Qualitative Analysis of POMDPs 

In this section we present lower bounds for the qualitative analysis of POMDPs. We 
first present the lower bounds for MDPs with perfect observation. 

Lower bounds for MDPs with perfect observations. In the previous section we ar- 
gued that for reachability objectives even in POMDPs the positive winning problem 
is NLOGSPACE-complete. For safety objectives and almost-sure winning it is known 
that an MDP can be equivalently considered as a game where Player 2 makes choices 
of the successors from the support of the probability distribution of the transition func- 
tion, and the almost-sure winning set is the same in the MDP and the game. Similarly, 
there is a reduction of games of perfect observations to MDPs of perfect observation 
for almost-sure winning with safety objectives. The problem of almost-sure winning in 
games of perfect observation is alternating reachability and is PTIME-complete [2, 16],. 
It follows that almost-sure winning for safety objectives in MDPs is PTIME-complete. 
We now show that the almost-sure winning problem for reachability and the positive 
winning problem for safety objectives is PTIME-complete for MDPs with perfect ob- 
servation. 

Reduction from the Circuit- Value-Problem. Let N = { 1, 2, . . . , n } be a set of 
AND and OR gates, and I be a set of inputs. The set of inputs is partitioned into Iq 
and 7i; I is the set of inputs set to (false) and I\ is the set of inputs set to 1 (true). 
Every gate receives two inputs and produces one output; the inputs of a gate are outputs 
of another gate or an input from the set I. The connection graph of the circuit must 
be acyclic. Let the gate represented by the node 1 be the output node. The ClRCUlT- 
VALUE-PROBLEM (CVP) is to decide whether the output is 1 or 0. This problem is 
PTIME-complete. We present a reduction of CVP to MDPs with perfect observation 
for almost-sure winning with reachability, and positive winning with safety objectives. 

1 . Almost-sure reachability. Given the CVP, we construct the MDP of perfect obser- 
vation as follows: (a) the set of states is N U I; (b) the action set is E = { I, r }; 
(c) the transition function is as follows: every node in I is absorbing, and for a state 
that represents a gate, (i) if it is an OR gate, then for the action I the left input gate 
is chosen with probability 1, and for the action r the right input gate is chosen with 
probability 1 ; and (ii) if it is an AND gate, then irrespective of the action, the left 
and right input gate are chosen with probability 1/2. The output of the CVP from 
node 1 is 1 iff the set I\ is reached from the state 1 in the MDP with probability 1 
(i.e., the state 1 is almost-sure winning for the reachability objective Reach(Ji).) 

2. Positive safety. For positive winning with safety objectives, we take the CVP, apply 
the same reduction as for almost-sure reachability with the following modifications: 
every state in I remains absorbing and from every state in ii the next state is the 
starting state 1 with probability 1 irrespective of the action. The set of safety target 



is the set 1\ U N. If the output of the CVP problem is 1, then from the starting 
state the set I\ is reached with probability 1, and hence the safety objective with 
the target N U I\ is ensured with probability 1 . If the output of the CVP problem 
is 0, then from the starting state the set I is reached with positive probability r\ > 
in n steps against all strategies. Since from every state in 1\ the successor state is 
the state 1, it follows that the probability to reach 7 from the starting state 1 in 
k ■ (n + 1) steps is at least 1 — (1 — i]) k , and this goes to 1 as k goes to oo. Hence it 
follows that from state 1, the answer to the positive winning for the safety objective 
Safe(iV U h) is YES iff the output to the CVP is 1. 

From the above results it also follows that almost-sure and positive Biichi and coBiichi 
objectives are PTIME-hard (and PTIME-completeness follows from the known polyno- 
mial time algorithms for qualitative analysis of MDPs with parity objectives [10, 11]). 

Theorem 4. Given an MDP G of perfect observation, the following assertions hold: 
(a) the positive winning problem for reachability objectives is NLOGSPACE-complete, 
and the positive winning problem for safety, Biichi, coBiichi and parity objectives is 
PTIME-complete; and (b) the almost-sure winning problem for reachability, safety, 
Biichi, coBiichi and parity objectives is PTIME-complete. 

Lower bounds for POMDPs. We have already shown that positive winning with reach- 
ability objectives in POMDPs is NLOGSPACE-complete. As in the case of MDPs with 
perfect observation, for safety objectives and almost-sure winning a POMDP can be 
equivalently considered as a game of partial observation where Player 2 makes choices 
of the successors from the support of the probability distribution of the transition func- 
tion, and the almost-sure winning set is the same in the POMDP and the game. Since 
the problem of almost-sure winning in games of partial observation with safety objec- 
tive is EXPTIME-complete [5], the EXPTIME-completeness result follows. We now 
show that almost-sure winning with reachability objectives and positive winning with 
safety objectives is EXPTIME-complete. Before the result we first present a discussion 
on polynomial-space alternating Turing machines (ATM). 

Discussion. Let M be a polynomial-space ATM and let w be an input word. Then, 
there is an exponential bound on the number of configurations of the machine. Hence 
if M can accept the word w, then it can do so within some k\ w \ steps, where \w\ is the 
length of the word w, and ki w i is bounded by an exponential in \w\. We construct an 
equivalent polynomial-space ATM M' that behaves as M but keeps track (in polyno- 
mial space) of the number of steps executed by M, and given a word |to|, if the number 
of steps reaches fc^i without accepting, then the word is rejected. The machine M' 
is equivalent to M and reaches the accepting or rejecting states in a number of steps 
bounded by an exponential in the length of the input word. The problem of deciding, 
given a polynomial-space ATM M and a word w, whether M accepts w is EXPTIME- 
complete. 

Reduction from Alternating PSPACE Turing machine. Let M be a polynomial- 
space ATM such that for every input word w, the accepting or the rejecting state 
is reached within exponential steps in \w\. A polynomial-time reduction Rq of a 
polynomial-space ATM M and an input word w to a game G = Rq(M,w) of par- 
tial observation is given in [9] such that (a) there is a special accepting state in G, and 



(b) M accepts w iff there is an observation-based strategy for Player 1 in G to reach the 
accepting state with probability 1 . If the above reduction is applied to M, then the game 
structure satisfies the following additional properties: there is a special rejecting state 
that is absorbing, and for every observation-based strategy for Player 1 , either (a) against 
all Player 2 strategies the accepting state is reached with probability 1 ; or (b) there is a 
pure Player 2 strategy that reaches the rejecting state with positive probability r\ > in 
2l L l steps and the accepting or the rejecting state is reached with probability 1 in 2l L l 
steps. We now present the reduction to POMDPs: 

1 . Almost-sure winning for reachability. Given a polynomial-space ATM M and w an 
input word, let G = Rg{M, w). We construct a POMDP G' from G as follows: we 
only modify the transition function in G' by uniformly choosing over the successor 
choices. Formally, for a state £ e L and an action a E £ the probabilistic transition 
function 5' in G' is as follows: 



Given an observation-based strategy for Player 1 in G, we consider the same strat- 
egy in G': (1) if the strategy reaches the accepting state with probability 1 against 
all Player 2 strategies in G, then the strategy ensures that in G' the accepting state 
is reached with probability 1 ; and (2) otherwise there is a pure Player 2 strategy /3 
in G that ensures the rejecting state is reached in 2^1 steps with probability -q > 0, 
and with probability at least (1/|L|) 2 L the choices of the successors of strategy 
(3 is chosen in G", and hence the rejecting state is reached with probability at least 
(1/|L|) 2 • 77 > 0. It follows that in G' there is an observation-based strategy for 
almost-sure winning the reachability objective with target of the accepting state iff 
there is such a strategy in G. The result follows. 
2. Positive winning for safety. The reduction is same as above. We obtain the POMDP 
G" from the POMDP G' above by making the following modification: from the 
state accepting, the POMDP goes back to the initial state with probability 1. If 
there is an observation-based strategy a for Player 1 in G" to reach the accepting 
state, then repeating the strategy a each time the accepting state is visited, it can 
be ensured that the rejecting state is reached with probability 0. Otherwise, against 
every observation-based strategy for Player 1, the probability to reach the rejecting 
state in /c-(2l L l + l) steps is at least 1 — (1 — 77')^, where 7/ = r/-(l/\L\) 2 L > 0(this 
is because there is a probability to reach the rejecting state with probability at least 
rl in 2' L steps, and unless the rejecting state is reached the starting state is again 
reached within 2^1 + 1 steps). Hence the probability to reach the rejecting state 
is 1. It follows that G' is almost-sure winning for the reachability objective with 
the target of the accepting state iff in G" there is an observation-based strategy for 
Player 1 to ensure that the rejecting state is avoided with positive probability. This 
completes the proof of correctness of the reduction. 

A very brief (two line proof) sketch was presented as the proof of Theorem 1 of [12] 
to show that positive winning in POMDPs with safety objectives is EXPTIME-hard. 
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We were unable to reconstruct the proof: the proof suggested to simulate a nondetermin- 
istic Turing machine. The simulation of a polynomial-space nondeterministic Turing 
machine only shows PSPACE-hardness, and the simulation of a nondeterministic EX- 
PTIME Turing machine would have shown NEXPTlME-hardness, and an EXPTIME 
upper bound is known for the problem. Our proof presents a different and detailed proof 
of the result of Theorem 1 of [12]. Hence we have the following theorem, and the results 
are summarized in Table 1 . 

Theorem 5. Given a POMDP G, the following assertions hold: (a) the positive win- 
ning problem for reachability objectives is NLOGSPACE-complete, the positive winning 
problem for safety and coBuchi objectives is EXPTIME- complete, and the positive win- 
ning problem for Btichi and parity objectives is undecidable; and (b) the almost-sure 
winning problem for reachability, safety and Biichi objectives is EXPTIME-complete, 
and the almost-sure winning problem for coBuchi and parity objectives is undecidable. 

Proof. The results are obtained as follows. 

1 . Positive winning. The NLOGSPACE-completeness for positive winning with reach- 
ability objectives is Theorem 1 . Our reduction from Alternating PSPACE Turing 
machine shows EXPTIME-hardness for positive winning with safety (and hence 
the lower bound also follows for coBuchi objectives), and the upper bounds follow 
from Theorem 2 and Theorem 3. The undecidability follows for positive winning 
for Biichi and parity objectives follows from the result of [1, 14]. 

2. Almost-sure winning. It follows from the results of [9, 1] that the decision problems 
for almost-sure winning for POMDPs with reachability, safety, and Biichi objec- 
tives can be solved in EXPTIME. Our reduction from Alternating PSPACE Tur- 
ing machine shows EXPTIME-hardness for almost-sure winning with reachability 
(and hence the lower bound also follows for Biichi objectives). The lower bound for 
safety objectives follows from the lower bound for partial information games [9] 
and the fact the almost-sure winning for safety coincides with almost-sure winning 
in games. The undecidability follows for almost-sure winning for coBuchi and par- 
ity objectives follows from the result of [1, 14]. 





Positive 


Almost-sure 


Reachability 


NLOGSPACE-complete (up+lo) 


EXPTIME-complete (lo) 


Safety 


EXPTIME-complete (up+lo) 


EXPTIME-complete [5] 


Biichi 


Undecidable [1] 


EXPTIME-complete (lo) 


coBuchi 


EXPTIME-complete (up+lo) 


Undecidable [1] 


Parity 


Undecidable [1] 


Undecidable [1] 



Table 1. Computational complexity of POMDPs with different classes of parity objectives for 
positive and almost-sure winning. Our contribution of upper and lower bounds are indicated as 
"up" and "lo" respectively in parenthesis. 



5 Optimal Memory Bounds for Strategies 



In this section we present optimal bounds on the memory required by pure and random- 
ized strategies for positive and almost-sure winning for reachability, safety, Biichi and 
coBuchi objectives. 

Bounds for safety objectives. First, we consider positive and almost-sure winning with 
safety objectives in POMDPs. It follows from the correctness argument of Theorem 2 
that pure strategies with exponential memory are sufficient for positive winning with 
safety objectives in POMDPs, and the exponential upper bound on memory of pure 
strategies for almost-sure winning with safety objectives in POMDPs follows from the 
reduction to games. We now present a matching exponential lower bound for random- 
ized strategies. 

Lemma 3. There exists a family (P n )neN of POMDPs of size 0(p(n)) for a poly- 
nomial p with a safety objective such that the following assertions hold: (a) Player 1 
has a (pure) almost-sure (and therefore also positive) winning strategy in each of these 
POMDPs; and (b) there exists a polynomial q such that every finite-memory random- 
ized strategy for Player 1 that is positive (or almost-sure) winning in P n has at least 
2«(") states. 

Preliminary. The set of actions of the POMDP P n is S n U {#} where S n = 
{1, . . . , n}. The POMDP is composed of an initial state q and n sub-MDPs At with 
state space Q t , each consisting of a loop over pi states q\,...,q p . where pi is the i-th 
prime number. From each state q l - (1 < j < pi), every action in E n leads to the next 
state q l j + i with probability |, and to the initial state qo with probability |. The action 
# is not allowed. From q p . , the action i is not allowed while the other actions in S n 
lead back the first state q\ and to the initial state q both with probability |. Moreover, 
the action leads back to the initial state (with probability 1). The disallowed actions 
lead to a bad state. The states of the Ai's are indistinguishable (they have the same ob- 
servation), while the initial state q a is visible. We assume that the state spaces Qi of the 
A^s are disjoint. 

POMDP family (P n )neN« The state space of P n is the disjoint union of Q\, . . . , Q n 
and {q , Bad}. The initial state is q , the final state is Bad. The probabilistic transition 
function is as follows: 

- for all 1 < i < n and a e S n , we have 5(qo, o~)(q\) = -; 

- for all 1 < i < n, 1 < j < pi, and a 6 S n , a' 6 S n \ {i}, we have 
5{q),a){q) +1 ) = 5{q),<j)(q ) = 5(q Pi ,a')(q{) - S(q Pi ,a')(q Q ) = \; and 

- for all 1 < i < n and 1 < j < p u we have 5(q , #)(Bad) = 5{q), #)(Bad) = 

5(«* i ,#)(«b) = l. 

The initial state is qo- There are two observations, the state {qo} is labelled by obser- 
vation oi , and the other states in Q i U • • • U Q n (that we call the loops) by observation o 2 . 
Fig. 2 shows the game P 2 : the witness family of POMDPs have similarities with analo- 
gous constructions for games [4]. However the construction of [4] shows lower bounds 
only for pure strategies and in games, whereas we present lower bound for randomized 
strategies and for POMDPs, and hence our proofs are very different. 
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Fig. 2. The POM DP P 2 . Fig. 3. The POM DP P?,. 



Proof of Lemma 3. After the first transition from the initial state, player 1 has the 
following positive winning strategy. Let p* = n"=iP«- While the POM DP is in the 
loops (assume that we have seen j times observation 02 consecutively), if 1 < j ' < p* , 
then play any action i such that j mod Pi ^ (this is well defined since p* is the 1cm 
of pi, . . . ,p„), and otherwise play #. It is easy to show that this strategy is winning for 
the safety condition, with probability 1. 

For the second part of the result, assume towards a contradiction that there exists a 
finite-memory randomized strategy a that is positive winning for Player 1 and has less 
than p* states (since p* is exponential in s* = Y^i=i Pi< me resu lt will follow). Let 77 be 
the least positive transition probability described by the finite-state strategy a. Consider 
any history of a play p that ends with o\. We claim that the following properties hold: 
(a) with probability 1 either observation o\ is visited again from p or the state Bad is 
reached; and (b) the state Bad is reached with a positive probability. The first property 
(property (a)) follows from the fact that for all actions the loops are left (the state q 
or Bad is reached) with probability at least i. We now prove the second property by 
showing that the state Bad is reached with probability at least A n = i • — 1 « . To 
see this, consider the sequence of actions played by strategy a after p when only 02 is 
observed. Either # is never played, and then the action played by a after a sequence 
of p* states leads to Bad (the current state being then q l pi for some 1 < i < n). This 
occurs with probability at least A n ; or # is eventually played, but since a has less than 
p* states, it has to be played after less than p* steps, which also leads to Bad with 
probability at least A n . The above two properties that (a) o\ U {Bad} is reached with 
probability 1 from 0\, and (b) within p* steps after a visit to o\, the state Bad is reached 
with fixed positive probability, ensures that Bad is reached with probability 1. Hence 
a is not positive winning. It follows that randomized strategies that are almost-sure or 
positive winning in POMDPs with safety objectives may require exponential memory. 

Bounds for reachability objectives. We now argue the memory bounds for pure and 
randomized strategies for positive winning with reachability objectives. 



1. It follows from the correctness argument of Theorem 1 that randomized memory- 
less strategies suffice for positive winning with reachability objectives in POMDPs. 

2. We now argue that for pure strategies, memory of size linear in the number of states 
is sufficient and may be necessary. The upper bound follows from the reduction to 
graph reachability. Given a POMDP G, consider the graph G constructed from G 
as in the correctness argument for Theorem 1 . Given the starting state t, if there 
is path in G to the target set T obtained from T, then there is a path tt of length 
at most \L\. The pure strategy for Player 1 in G can play the sequence of actions 
of the path tt to ensure that the target observations T are reached with positive 
probability in G. The family of examples to show that pure strategies require linear 
memory can be constructed as follows: we construct a POMDP with deterministic 
transition function such that there is a unique path (sequence of actions) of length 
0(|L|) to the target, and any deviation leads to an absorbing state, and other than 
the target state every other state has the same observation. In this POMDP any 
pure strategy must remember the exact sequence of actions to be played and hence 
requires 0(|L|) memory. 

It follows from the results of [1] that for almost-sure winning with reachability objec- 
tives in POMDPs pure strategies with exponential memory suffice, and we now prove 
an exponential lower bound for randomized strategies. 

Lemma 4. There exists a family (P n )n&i of POMDPs of size 0(p(n)) for a polyno- 
mial p with a reachability objective such that the following assertions hold: (a) Player 1 
has an almost-sure winning strategy in each of these POMDPs; and (b) there exists a 
polynomial q such that every finite-memory randomized strategy for Player 1 that is 
almost-sure winning in P n has at least 2 q ^ states. 

Fix the action set as S = {#,tick}. The POMDP P' n is composed of an initial 
state qo and n sub-MDPs Hi, each consisting of a loop over pi states q\,...,q pi where 
Pi is the i-ih prime number. From each state in the loops, the action tick can be played 
and leads to the next state in the loop (with probability 1). The action # can be played 
in the last state of each loop and leads to the Goal state. The objective is to reach Goal 
with probability 1. Actions that are not allowed lead to a sink state from which it is 
impossible to reach Goal. There is a unique observation that consists of the whole state 
space. Fig. 3 shows P^. 

Proof of Lemma 4. First we show that Player 1 has an almost-sure winning strategy 
in P' k (from q ). As there is only one observation, a strategy for Player 1 corresponds 
to a function a : N — > S. Consider the strategy a* as follows: a*(j) = tick for all 
< j < P* k and a*(j) = # for all j > pi. It is easy to check that a* ensures winning 
with certainty and hence almost-sure winning. 

For the second part of the result assume, towards a contradiction, that there exists a 
finite-memory randomized strategy a that is almost-sure winning and has less than pi 
states. Clearly, d cannot play # before the (pi + l)-th round since one of the subMDPs 
Hi would not be in q p . and therefore Player 1 would lose with probability at least K 
Note that the state reached by the strategy automaton defining a after pi rounds has 
necessarily been visited in a previous round. Since a has to play # eventually to reach 



Goal, this means that # must have been played in some round j < p£, when at least one 
of the subgames Hi was not in location q l p ., so that Player 1 would have already lost 
with probability at least — ■ 77, where r\ is the least positive probability specified by a. 
This is in contradiction with our assumption that a is an almost-sure winning strategy. 

Bounds for Biichi and coBiichi objectives. An exponential upper bound for memory 
of pure strategies for almost-sure winning of Biichi objectives follows from the results 
of [1], and the matching lower bound for randomized strategies follows from our result 
for reachability objectives. Since positive winning is undecidable for Biichi objectives 
there is no bound on memory for pure or randomized strategies for positive winning. An 
exponential upper bound for memory of pure strategies for positive winning of coBiichi 
objectives follows from the correctness proof of Theorem 3 that iteratively combines 
the positive winning strategies for safety and reachability to obtain a positive winning 
strategy for coBiichi objective. The matching lower bound for randomized strategies 
follows from our result for safety objectives. Since almost-sure winning is undecidable 
for coBiichi objectives there is no bound on memory for pure or randomized strategies 
for positive winning. This gives us the following theorem (also summarized in Table 2), 
which is in contrast to the results for MDPs with perfect observation where pure mem- 
oryless strategies suffice for almost-sure and positive winning for all parity objectives. 

Theorem 6. The optimal memory bounds for strategies in POMDPs are as follows. 

1. Reachability objectives: for positive winning randomized memoryless strategies are 
sufficient, and linear memory is necessary and sufficient for pure strategies; and for 
almost-sure winning exponential memory is necessary and sufficient for both pure 
and randomized strategies. 

2. Safety objectives: for positive winning and almost-sure winning exponential mem- 
ory is necessary and sufficient for both pure and randomized strategies. 

3. Biichi objectives: for almost-sure winning exponential memory is necessary and 
sufficient for both pure and randomized strategies; and there is no bound on mem- 
ory for pure and randomized strategies for positive winning. 

4. coBiichi objectives: for positive winning exponential memory is necessary and suf- 
ficient for both pure and randomized strategies; and there is no bound on memory 
for pure and randomized strategies for almost-sure winning. 





Pure Positive 


Randomized Positive 


Pure Almost 


Randomized Almost 


Reachability 


Linear 


Memoryless 


Exponential 


Exponential 


Safety 


Exponential 


Exponential 


Exponential 


Exponential 


Biichi 


No Bound 


No Bound 


Exponential 


Exponential 


coBiichi 


Exponential 


Exponential 


No Bound 


No Bound 


Parity 


No Bound 


No Bound 


No Bound 


No Bound 



Table 2. Optimal memory bounds for pure and randomized strategies for positive and almost-sure 
winning. 
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